CPSC 545/445 (Autumn 2003) - Class 19: Gene Expression Analysis & Regulatory Network Inference Module 6, Part 3 ---- brief recap: - clustering methods for gene expression data -> many general clustering + classification methods, in many cases based on local search - clustering is (to some extent) an art - no universal, agreed-upon criteria for eval of solutions - no ultimate alg (problem too general; but true even for gene expression) - further improvement of alg needs more data (real + synthetic) [from conclusions of Shamir and Sharan] --- 6.3 cancer diagnosis: algorithms for class prediction and class discovery Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531-537, 1999. [reading assignment, not covered in class] --- 6.4 Inference of Regulatory Relationships from Gene Expression Data background: - many known regulators control transcription of mRNA - identification of regulator/regulatee pairs forms basis for inference of regulatory networks - here: detect pairs based on microarray time series data = gene expression level profiles obtained from multiple microarray experiments [slide] types of regulation between two genes X,Y: activation vs. inhibition: activation: X+ -> Y+ inhibition: X+ -> Y- direction/causation: X regulates Y != Y regulates X intuitively: X regulates Y -> features in X preceeds corresponding feaures in Y (cause and effect) direct vs. indirect: X -> Y (no intermediaries) X -> Z -> Y X -> Z_1 -> .... Z_k -> Y unconditional vs. conditional: X -> Y (always) X -> Y in the presence/absence of Z X -> Y in the presence of Z_1, ..., Z_k and absence of W_1, ..., W_m three approaches: - correlation analysis: use Pearson correlation coefficient r(X,Y) = ss_xy / sqrt(ss_xx*ss_yy), where ss_xx = sum_i[(x_i-avg(x))^2] ss_yy = sum_i[(y_i-avg(y))^2] s_xy = sum_i[(x_i-avg(x))(y_i-avg(y))] as indicator of similarity of expression profiles X,Y correlation coefficient measures strength of linear dependence between X,Y; -1 or 1 = perfect association; 0 = no association [activation -> r(X,Y) = ? same for inhibition?] widely used for analysis of gene expression profiles problems/weaknesses: - good for detecting global similarity, but not local similarity (which may be due to conditional regulatory relationships) - doesn't take into account temporal aspects -> only non-directional predictions doesn't account for consistent delays (e.g., due to indirect regulation) - edge detection method by Filkov et al. (2001): basic idea: improve detection of local similarity by focussing on major changes in gene expression level (edges) high level algorithm outline: - determine major edges in each profile X,Y - remove spurious edges - compute score from pairs of edges in X,Y that have the same direction (rising/falling) and are temporally close note: designed to deal with noise, time delays, local similarity; gives directional predictions [slide] - event method by Kwon et al. (2003): basic idea: study relationship between major features in profiles high level algorithm outline: - abstract profiles into strings of (time stamped) events use smoothing and filtering to remove noise - align event strings (using variant of Needleman-Wunsch alg for global seq alignment) to obtain score scoring matrix: R C F R S(dT) 0 -beta*S(dT) C 0 0 0 F -beta*S(dT) 0 alpha*S(dT) dT = time difference between aligned events note: designed to deal with noise, time delays, local similarity, known assymmetries between increases and decreases in expression level gives directional predictions results: - biological data: - yeast cell cycle data. Spellman et al., 1998 contains profiles for all ORFs in yeast (> 6000) (problem: very limited temporal resolution - 16 time points; inaccuracies due to problems in cell cycle synchronisation) reduced to 458 genes involved in 888 known regulations (note: > 200 000 directional candidate interactions) - compare rankings of candidate pairs based on correlation, edge, event scores - result: - number of true positives in top-k predictions comparable between all three methods - very little overlap between true positive predictions of different methods -> methods complement each other! - artificial data: - allows to abstract from weaknesses in real data, specifically test dependence of performance on individual factors (e.g., amount of delay, degree of similarity, etc.) -> event significantly superior to correlation in most cases (especially for data with large delays, only local similarity) (edge couldn't be tested because of many 0 scores - no major edges detected) [slides: andrew's thesis, 13,14,33,39,40,45,47,50] -- piece together pairwise regulations to obtain full network (many complications in the details, work in progress in BETA Lab @ UBC/CS) -- different approach: Cheng et al. (2000) identify entire network by - building graph of good candidate pairs (from pairwise analysis) - find subset of edges corresponding to true regulations using combinatorial optimisation methods and optimisation objective that models assumptions about general (local) structure of network (e.g., all regulatees have at least one activator and one inhibitor) (hard problem -> use SLS methods, here simulated annealing, to solve) -- similarly, Baysian network learning methods have been applied to the genetic regulatory network inference problem (Friedman et al., 2000) --- Resources: * Golub et al. Molecular Classfication of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531-537, 1999. * Andrew T. Kwon, Holger H. Hoos, and Raymond Ng: Inference of transcriptional regulation relationships from gene expression data. Bioinformatics 19:905-912, 2003. Further Reading: Ben-Dor et al. Tissue Classification with Gene Expression Profiles, Recomb 2000. M.J. van der Laan and J. Bryan. Gene Expression Analysis with the Parameter Bootrap, Biostatistics, 2001 Kwon,A. (2002) Inference of transcriptional regulation relationships with gene expression data, Master’s thesis, Department of Computer Science, UBC, 2002. ---